On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases
نویسندگان
چکیده
For a wide variety of classification algorithms, scalability to large databases can be achieved by observing that most algorithms are driven by a set of sufficient statistics that are significantly smaller than the data. By relying on a SQL backend to compute the sufficient statistics, we leverage the query processing system of SQL databases and avoid the need for moving data to the client. We present a new SQL operator (Unpivot) that enables efficient gathering of statistics with minimal changes to the SQL backend. Our approach results in significant increase in performance without requiring any changes to the physical layout of the data. We show analytically how this approach outperforms an alternative that requires changing in the data layout. We also compare effect of data representation and show that a “dense” representation may be preferred to a “sparse” one, even when the data are fairly sparse.
منابع مشابه
Green envelopes classification: the comparative analysis of efficient factors on the thermal and energy performance of green envelopes
This paper classifies green envelopes as green roofs and green walls according to effective factors, which were derived from literature to compare the green envelopes’ thermal and energy performance in a more effective way. For this purpose, an extensive literature review was carried out by searching keywords in databases and studying related journal papers and articles. The research meth...
متن کاملAn algorithm for the anchor points of the PPS of the CCR model
Anchor DMUs are a new class in the general classification of Decision Making Units (DMUs) in Data Envelopment Analysis (DEA). An anchor DMU in DEA is an extreme-efficient DMU that defines the transition from the efficient frontier to the free-disposability part of the boundary of the Production Possibility Set (PPS). In this paper, the anchor points of the PPS of the CCR model are investigated....
متن کاملکاهش ابعاد دادههای ابرطیفی به منظور افزایش جداییپذیری کلاسها و حفظ ساختار داده
Hyperspectral imaging with gathering hundreds spectral bands from the surface of the Earth allows us to separate materials with similar spectrum. Hyperspectral images can be used in many applications such as land chemical and physical parameter estimation, classification, target detection, unmixing, and so on. Among these applications, classification is especially interested. A hyperspectral im...
متن کاملQuerying Hierarchical Data in Very Large Databases
Hierarchical data, such as Partially Ordered Set (POSET) is tremendously used in relational databases, especially in data mining and data warehouse based-applications. Unfortunately, SQL (Structured Query Language) does not effectively support hierarchical data structure to manage this sort of data, for example, in Oracle, a CONNECT BY operator is used to query data organized into trees, howeve...
متن کامل3D Detection of Power-Transmission Lines in Point Clouds Using Random Forest Method
Inspection of power transmission lines using classic experts based methods suffers from disadvantages such as highel level of time and money consumption. Advent of UAVs and their application in aerial data gathering help to decrease the time and cost promenantly. The purpose of this research is to present an efficient automated method for inspection of power transmission lines based on point c...
متن کامل